The primary aim of this project was to have an automated model and webscraper that, each day, updates it parameters to reflect the most recent results, and scrapes various betting markets in search of opportunities. I am not a bettor, nor have I even bet based on this project! I chose sports markets as they offer an interesting and dynamic source of live data, and particularly one where my domain knowledge is not as lacking as perhaps finance or healthcare!
My motivation for pursuing this project was combination of the statistical and technical challenge; I wanted to apply what I learned in my degree to a messy real-world statistical problem, and to improve my technical skillset. In fact, the technical challenges I faced were far greater than I first imagined!
This project contains an automated betting odds scraper for NBA matches. It does this for multiple bookies and returns the best price. Arbitrage, the situation where risk-free profit can be made by placing opposing bets with different bookies, and changepoint monitoring (Adaptive Forgetting Factor scheme from Bodenham and Adams (2016)) are part of the code, as well as monitoring differences in the odds from modelled win probabilities.
The chosen model was the well known Bradley-Terry model, trained on last seasons match results, scraped from ESPN. The model dynamically updates each day to include the new information in the previous days results.
The main components of the project are:
1. Scraping the odds from multiple bookies at regular
intervals. When matches are live, the scrape frequency increases.
2. Monitoring the best odds at each time period, and detecting
any arbitrages, sudden changes in the odds (for live matches), or
discrepancies with modelled probabilites.
3. Modelling and predicting the outcome of each match, which
updates automatically to include the results of the previous day. This
provides a reference point to compare odds to.
4. Notifying me of anything that comes up in monitoring, via an
automated email system.
The project is hosted on a digital ocean droplet running Ubuntu 22.04 (Linux). This allows for cron scheduling of the automated components, using Process Manager 2 (PM2).
To model each match, I chose a Bradley-Terry model. It is trained on head to head data, which I scraped from last season, and predicts outcomes by assigning each team i a ‘strength’ pi. It also allows for other effects, for example home advantage. I chose just last seasons data as I felt the two prior covid years were unreflective of a regular season, and anything prior to that was too far in the past. As the model retrains on the new data as the season progresses, I felt this was justified.
The base of the model is the probability team i beats team j: \[ Pr \lbrace\ i\ beats\ j\ \rbrace = \frac {p_i} {(p_i + p_j) }\\ \] And from that we derive a logistic model (where pi = exp(βi)):
\[ logit (Pr \lbrace i\ beats\ j \rbrace) = log \left(\frac {Pr \lbrace\ i\ beats\ j\ \rbrace} {Pr \lbrace\ j\ beats\ i\ \rbrace} \right) = log \left(\frac {p_i} {p_j} \right) = \beta_i - \beta_j \] Adding home advantage, alpha:
\[ logit(Pr \lbrace i\ beats\ j \rbrace) = \alpha + \beta_i - \beta_j \]
I chose this model as it is a well known model for sports events, and I want to use it as a baseline to analyse the efficiency of the betting markets. In future I would like to extend this project to include the Bradley-Terry-Luce model and the Thurstone-Mosteller model. I would also hope to include BTDecay, a Bradley-Terry model which places heavier weight on more recent results, akin to placing a high value on current ‘form’. In each case, I will also be comparing these to win frequency as a control.
Below I have a plot of each teams Bradley-Terry ‘strengths’ against win frequency. As we can see, the two are quite closely correlated, ~95%, however the differences arise from how teams perform against different strength teams, i.e the same win frequency will result in a higher BT strength if the wins are against stronger opponents. The Phoenix Suns were by far and away the ‘strongest’ team head to head based on last seasons data, despite not making it to the final.
Below is an example of what the results dataset looks like in R. The variable ‘result’ represents a home team win and is the target our logistic model fits.
The data:
While home advantage is a well known phenomenon in sport, it is still important to statistically justify including it in our Bradley Terry model. To do this, I performed a chi-squared likelihood ratio test comparing a model with home advantage to one without. This output is contained below, and clearly we can reject the null hypothesis that there is no home advantage, as the p value is statistically significant at the 1% level.
Efficient betting markets should reflect all available information and therefore sports odds should reflect an accurate (but slightly lower) implied probability for a given outcome. However, if bookies base their odds on betting flow to give themselves a neutral position (not long or short any given outcome), then there may exist situations where bookies’ odds deviate from the true probabilities. There also may be occasional disagreement among bookies on the odds offered, and a well equipped bettor can potentially profit from an arbitrage bet this situation, which is positive expected value irrespective of the outcome. This is equivalent to the implied probabilities of the odds on all outcomes summing to less than 1.
To investigate these phenomena, I needed to build a process that automatically scraped the odds from multiple bookies at regular intervals. I also needed to scrape the most recent match results automatically and update my model to reflect this new information. This process should then check for arbitrage opportunities and discrepancies between the modelled and implied probabilities. On the recommendation of my statistics professor, I also included changepoint monitoring for the data stream of odds from live matches, using the Adaptive Forgetting Factor algorithm from Bodenham and Adams (2016). This has potential use if one or many of the bookies are slow to react to a certain event in a live game, although at the latency I’m working I doubt I stand a chance of catching this event!. Nonetheless, it is an interesting component to add to the project.
I chose python over R as it offered the most flexibility for scripting, however for the AFF algorithm and Bradley-Terry model, I used the R packages integrated to the python scripts with the rpy2 package. I used the selenium package for webscraping, as it allowed me to click through various buttons/pages where necessary. It also for customisation which would make my webscraper seem more ‘human’. Many of the websites I chose do not want to be scraped, for obvious reasons, so there were plenty of challenges dealing with bot firewalls. (As an aside, it is not illegal nor against the terms of service of any of the websites in this project to scrape them. For some, however, it is against the terms of service to place bets based on this information. I have not done this, nor do I intend to.)
I chose to host the project on a digitial ocean droplet, runnning Ubuntu 22.04. I had never encountered Linux OS before, nor had I ever used ssh protocol or anything of the sort, so this was a steep learning curve, and it took quite some time to transport scripts that worked on my local machine onto the virtual machine. Once there, I often ran into different challenges, such as the cloud IP address being blocked for some websites, or blocking selenium headless browsers etc. I learnt a lot about IP adresses, user-agents, and even how to combine a selenium webscraper with a Tor relay to select any specific exit node (which I ultimately chose not to use in this project, see my other repo here). To automate the process, I chose a cron scheduler using Process Manager 2 (PM2). The scraping runs every 15 minutes early in the day, before games begin, and then every 2 minutes thereafter. After each pass of the scraper, the ‘processing’ script is run, which reads in the odds, finds the best for each team, checks for arbitrages, compares each set of odds to our Bradley-Terry win probabilities for modelling discrepancies, and checks for changepoints in live games. When any of these are discovered, the script sends me a brief email with the relevant information. Each day at midnight Pacific Time, the previous days results are scraped from ESPN and the Bradley-Terry strengths are recalculated.
If you’ve read even half of that, thank you, because you really didn’t have to! While I understand it’s not the most statistically or technically sophisticated bot out there, I really enjoyed making it and hope it demonstrates my willingness to learn on the fly and adapt to challenges. Hopefully by the next time you check back here I’ll have added some new features!